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WEIGHTED  REGRESSION  ANALYSIS 
AND  INTERVAL  ESTIMATORS 


Abstract. — A  method  for  deriving  the  weighted  least  squares  estima- 
tors for  the  parameters  of  a  multiple  regression  model.  Confidence 
intervals  for  expected  values,  and  prediction  intervals  for  the  means 
of  future  samples  are  given. 
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Weighted  regression  analysis  is  applicable 
to  many  forestry  problems.  Cunia  (1964)  dis- 
cusses in  detail  weighted  regression  analysis 
and  analyzes  the  relationship  between 
volume  and  diameter  in  black  spruce.  Many 
elementary  statistics  books  cover  weighted 
regression  analysis,  but  generally  there  is  little 
or  no  discussion  of  interval  estimators.  The 
purpose  of  this  paper  is  to  discuss:  (1)  con- 
fidence intervals  for  expected  values;  (2)  pre- 
diction intervals  for  means  of  future  samples 
when  the  parameters  of  a  multiple  regression 
model  are  estimated  by  weighted  least  squares. 

THE  REGRESSION  MODEL 

Suppose  we  have  a  sample  of  n  individuals. 
The  model  for  itL  observation  is 


where  y  is  the  nxl  vector  of  ^pservations,  X 
is  the  nxp  design  matrix,  (3  is  a  pxl  vector  of 
unknown  constants,  and  e  is  the  nxl  vector 
of  errors.  It  is  also  assumed  that  the  errors 
are  normally  distributed  with  a  mean  and 
variance-covariance  matrix 

E(e)  =0 

2e  =  E(ee')  =  V<r2 
where  V  is  a  known  positive  definite  matrix. 
a-  is  a  positive  scalar  and  is  assumed  to  be 
unknown.  In  many  cases  V  is  a  function  of  X. 

The  form  of  V  depends  on  the  variances 
and  covariances  of  the  observations.  Suppose 
that  the  errors  have  unequal  variances  and 
are  mutually  independent.  In  this  case,  the 
variance-covariance  matrix  of  the  errors  is 


yi  =  Xii/?i-f-x2i/32+  .  .  .+xpi/?p+ e; 


Va"  = 


or 


yi  =  xi/3+ei.  (1) 

The  set  of  n  observations  can  be  succinctly 
written 

y=X/J+e  (2) 


0 


Vo 


0  VnJ 

V  can  be  written  in  the  form  V  =  P'P  =  PP 

=  P2. 

Since  V  is  diagonal  we  have  P  =  VJ/-  and  P1 
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WEIGHTED  LEAST  SQUARES 
ESTIMATORS 

Weighted  least  squares  estimators  can  be 
obtained  by  transforming  the  original  ob- 
servations to  variables  that  satisfy  the  assump- 
tions for  ordinary  least  squares.  Pre-multiply- 
ing  equation  (2)  by  P  1  gives 

p  i  y  _  p  iX/?  +  p  ie 

or  Z  =  Q/J  +  f.  (3) 

Equation  (3)  is  an  ordinary  multiple  regres- 
sion model;  that  is, 

E  (f)  =  OandE  (f  f ')  =  I^2. 

Unweighted  least  squares  theory  can  be 
applied  directly  to  the  transformed  model. 
The  sum  of  squares  of  the  transformed  errors 
is 

S=ff 
=e'V1e. 

Since  V  is  diagonal,  the  sums  of  squares  can 
be  written  as 

S  =  Sivr^i2. 

Hence  S  is  the  weighted  sums  of  squares  of 
the  errors. 

Weighted  least  squares  estimates  are  ob- 
tained by  minimizing  S  for  variation  in  b 
where  b  is  the  solution  vector  corresponding 
to  the  parameter  vector  /?.  The  weighted  nor- 
mal equations  are 

Q'Qb=Q'Z 

or  X'V1Xb=X'V-1y- 

Solving  the  weighted  normal  equations  for  b 
gives  the  weighted  least  squares  estimator. 
The  solution  is 

b^Q'Q^Q'Z 

=  (X,V-1X)-1X/V"1y- 

The  sums  of  squares  of  transformed  residuals 
is 

SSR  =Z'Z-b'Q'Z 
=y'V-1y-y'V-1X 

(X'V^X^X'V-V- 

The  sample  variance  is 

s2  =  SSR  /  (n-p) 


which  is  an  unbiased  estimator  of  ,r2.  The 
variance-covariance  matrix  of  b  is 

2b=  (Q'QJV2 

=  (X'V^XW2. 
The  sample  variance-covariance  matrix  of  b  is 
%b=  (Q'Q)-^2 

=  (X'V^X)-^2. 

Given  V,  values  of  b,  s2,  and  ib  are  easily 
obtained  by  ordinary  least  squares  analysis  on 
the  transformed  variables  z  and  q. 

PREDICTED  VALUES  AND  INTERVAL 
ESTIMATORS 

Consider  a  future  sample  of  k  independent 
observations  (y?,  x^),  (y°2,  x°2),  .  .  .  ,  (y°k,  xl). 
The  average  of  the  future  values  of  the  de- 
pendent variable  is 

r  =  4iy?/k. 

The  statistic  y°  is  normally  distributed  with  a 
mean  and  variance 

E  (y°)  =  x0/? 
and     o-2(y°)  =  2j=iVi(r2 
The  sample  estimator  of  o-2  (y°)  is 

s2(y°)  =  vs2  /k. 

Now  consider  the  predicted  values  of  de- 
pendent variable  based  on  the  regression  esti- 
mates. The  prediction  for  the  value  of  the  jth 
future  observation  is 

y*i  =  x°jb. 

The  average  of  the  regression  estimates  is 

y*  =  (1/k)  Sjk=1  y*J  =  x°b 

where  x°  =  (1/k)  S^x" 

=  average  of  the  vectors  Xj? 

The  statistic  y*  is  normally  distributed  with 
a  mean  and  variance 

E  (y*)  =  x°J3 
and     a2(y*)  =  x°2bx0'. 
The  sample  estimator  of  o-2  (y*)  is 

s2(y*)  =  x°  (Q'Q)-1!0^2. 


Under  the  assumptions  of  the  model  the 
statistic 

t  =  (x°b-x°j8)/s  (y*) 

has  a  Student's  t  distribution  with  n-p 
degrees  of  freedom.  Consequently,  the  con- 
fidence interval  for  the  expected  value  of  the 
average  of  the  regression  estimates  is  obtained 
from  the  probability  statement 

P(x°b-ts  (y*)<x°  p  <x°b  +  ts  (y*)  =  1-a 

where       t  =  ti-a/2,n-v-  (4) 

We  are  also  interested  in  a  prediction  inter- 
val for  the  future  mean  y°.  The  prediction 
interval  gives  on  a  probability  basis  the  range 
of  error  of  the  future  mean. 
Let  d  =  y°  -y*.  The  statistic  d  is  normally 
distributed  with  a  mean  and  variance  of 

E(d)  =  E(y°)  -E(y*)  =  x°0  -  x°/3  =  0 

^(v/k  +  x^Q'Q)-1^'). 
The  sample  estimator  of  <r2(d)  is 

s2(d)  =  s2(v/k  +  x^Q'Q)1!0')- 

Note  that  the  statistics  y°  and  x°b  are  statistic- 
ally independent  since  they  are  based  on  in- 
dependent samples. 

It  follows  from  the  assumptions  that  the 
quantity  d/sd  has  a  Student's  t  distribution 
with  n-p  degrees  of  freedom.  Therefore  the 
prediction  interval  for  the  future  mean  y° 
given  (x°  x°  ...  x°)  can  be  calculated  from 
the  probability  statement 

P(x°b  -  tsd  <y°  <x°b  +  tsd)  =  (5) 

Several  examples  of  situations  where  these 
types  of  intervals  arise  follow. 

I.  In  regression  analysis  the  confidence  in- 
tervals for  the  expected  value  of  y  given 
x°  can  be  calculated  for  several  values  of 
x°.  The  upper  and  lower  confidence  limits 
are  often  plotted  about  the  estimated 
regression  line.  Also,  it  is  a  common 
practice  to  plot  the  prediction  interval 
for  one  future  value  given  x°.  In  this  case, 
the  quantities  x"  and  v  in  the  interval 
estimators  are  replaced  by  x°  and  v  re- 
spectively. The  vector  x°  is  single  vector 


of  specific  values  say  (x?,  x°,  .  .  .  x°)  and 
v  is  the  weight  associated  with  x°. 

II.  Consider  a  population  of  a  large  number 
(N)  of  trees  where  the  trees  are  mea- 
surable for  volume  (y)  and  diameter  at 
breast  height  (d).  A  random  sample  of 
n  trees  is  measured  and  a  parabolic 
regression  E  (y)  =  a  +  (3d  +  yd2  is  esti- 
mated by  weighted  least  squares.  Sup- 
pose that  sometime  in  the  near  future, 
k  trees  of  preselected  diameters  are  to  be 
sampled  from  the  forest.  The  sample  will 
consist  of  ki  trees  of  diameter  di,  ko 
trees  of  diameter  d2,  .  .  .,  ks  trees  of  dia- 
meter ds.  Also  assume  that  the  size  of  the 
future  sample  k  is  small  in  respect  to  the 
number  of  trees  in  the  population  N. 

We  are  interested  in 

(1)  .  A  point  estimator  of  the  average 
volume  y*.  The  estimator  is 

x°b=  (l,d,d2)  (lJ,yY 

where  d  =  2f=i  kjdj/k, 

d2  =  2-5l  kid,2/k,  and  p  and  y  are  the 
weighted  least  squares  estimates  of  (3 
and  y. 

(2)  .  The  1-a  confidence  interval  for  the 
expected  value  of  y*  which  is  obtained 
from  equation  (4). 

(3)  .  The  1-a  prediction  interval  for  the 
future  mean  (y°)  which  is  obtained  from 
equation  (5). 

III.  Consider  the  case  where  an  entire  forest 
is  harvested.  All  values  of  the  vector  x 
are  measured.  Let  u  be  the  mean  of  all 
vectors  Xj.  A  small  random  sample  of  ob- 
servations (y,  x)  are  measured  and  the 
weighted  parabolic  regression  is  esti- 
mated. 

(1)  .  The  multiple  regression  estimator 
of  the  average  volume  is  u  b. 

(2)  .  The  confidence  interval  for  u  b  is 
is  given  by 

ub±  t,-«/,  „,,  [u  (Q'Qr'u's2]"2 

IV.  It  may  be  too  expensive  to  measure  the 
diameter  of  all  the  trees.  Instead,  the 
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diameter  is  measured  on  a  second  large 
independent  sample.  The  double  sam- 
pling estimator  of  u  /?  is  xdb  where  x~d  is 
the  mean  vector  from  the  second  sample. 
Equation  (5)  is  not  the  proper  expression 
for  the  confidence  interval  for  xd/?  be- 
cause x"a  is  a  random  vector.  An  approxi- 
mation of  <r2(xdb)  is  given  by  Sen  (1973). 

EXAMPLES 

Cunia  (1964)  found  a  curvilinear  relation- 
ship between  the  volume  and  diameter  in  black 
spruce.  The  relationship  can  be  written 

y,  =  a+ySdi+ydf  +ei  (6) 
where  yi  is  the  volume  and  d;  is  the  diameter 
of  the  ith  tree.  He  also  found  that  the  variance 
of  the  volume  can  reasonably  be  assumed  to 
be  proportional  to  the  fourth  power  of  the  dia- 
meter. Assuming  the  errors  are  independent, 
the  variance-covariance  matrix  is 
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"d^ 


0" 


.0  dn4J 

The  matrices  P  and  P"1  are 


P  = 


"dx3 
d22 

0 


0' 


d  2 


and  P  1  = 


dv2 


0 


d  2 

Premultiplying  the  set  of  observations  by  P"1, 
results  in  a  set  of  equations  whose  ith  row  is 


■(1) 


ei/di2 
(7) 


yi/di2  =  a(l/di2)  +/3  (1/di)  4 

or  Zi  =  aqn  +  /?q2i  +  yq3i  +  fi 

The  first  step  in  the  weighted  regression  an- 
alysis is  to  transform  the  data.  Cunia  (1964: 
Table  3)  gives  diameters  and  volumes  for  25 
black  spruce.  The  original  measurements  have 
the  form 

Tree  No.  Diameter  (d)  Volume  (y) 
( inches )       ( cubic  ft.) 

1  3.9  1.0 

2  4.1  1.6 


25 


12.7 


25.4 


The  transformed  values  needed  for  analysis 
for  the  weighted  multiple  regression  are 


Tree 

No.  qi=l/d2 

1  .065740 

2  .059487 


q2=l/d  q8=1.0  z=y/d2 


25 


.006193 


.2564 
.2439 


.0787 


1.0 
1.0 


1.0 


.0657 
.0952 


.1575 


The  transformed  data  can  be  analyzed  with 
any  multiple  regression  program.  Most  re- 
gression programs  print  b,  s2,  and  (Q'Q)"1. 
Computer  programs  with  the  option  for  testing 
hypotheses  of  the  form  x/3  =  d  should  also 
print  the  quantities  xb  and  x(Q'Q)_1x'. 

An  ordinary  least  squares  analysis  of  the 
transformed  values  was  done  with 
BIOMEDX63.*  The  statistics  of  interest  are 


b'  =  (a  p  y)  =  (1.19040,  -0.76579,  0.19638) 
(Q'Q)1  = 
and  s2  =  0.0053/22  =  0.000241 


'  6218.83 
-2024.18 
145.87 


-2024.18  145.87 
672.12  -49.59 
-49.57  3.79 


Confidence  Intervals  for  x/J 
and  Prediction  Intervals  for 
one  future  observation  y° 

Suppose  we  want  the  expected  volume  and 
interval  estimates  for  a  10  inch  diameter  black 
spruce.  The  estimate  is  xb  =  (1,  10,  100)  b  = 
13.17070  cubic  feet.  To  compute  the  .95  con- 
fidence interval  we  also  need 

x(Q'Q)-'x'  =  912.12102 

and  t  .075,22  =  2.074. 

Then  from  equation  (4)  we  have 

P(12.20  <a+10^+100y  <  14.14)  =  .95 

To  compute  the  prediction  interval  for  volume 
of  a  single  future  observation  for  a  10  inch 
black  spruce  we  need 

v/k  =  d4/l 
=  104 

The  prediction  interval  is 

P(9.81  <  y°  <  16.53)  =  .95. 


*  See  BIOMEDX63-multivariate  general  linear 
hypothesis.  Univ.  Cal.  Publi.  in  Autom.  Comput. 
3.  W.  J.  Dixon,  Editor.  Univ.  Cal.  Press.  1969. 
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The  sample  regression  line,  confidence  inter- 
vals, and  prediction  intervals  for  one  future 
value  are  shown  in  Fig.  1.  Note  the  increasing 
width  of  confidence  and  prediction  intervals 
with  increasing  diameter.  The  flare  in  the  in- 
tervals is  the  result  of  assumption  that  the 
variance  of  the  volume  is  proportioned  to  the 
fourth  power  of  the  diameter. 


Figure  I. — Weighted  regression  line,  0.95 
confidence  intervals,  and  0.95  prediction  in- 
tervals based  on  the  sample  of  25  black  spruce. 
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y=1. 19040— 3.76579d-0  19638d 


 0.95  CONFIDENCE  LIMITS 

  0  95  PREDICTION  LIMITS 


DIAMETER  (IN  INCHES) 


Multiple  Regression  Estimate 

Cunia  (1964:  Table  2)  gives  the  following 
diameter  distribution  for  1188  black  spruce. 


Diameter 

Number 

4 

156 

5 

321 

6 

265 

7 

130 

8 

146 

9 

84 

10 

19 

11.5 

51 

13.5 

12 

15.5 

4 

The  mean  vector  is  u  =  (l.d.d2)  =  (1,  6.44, 
45.77).  The  estimated  mean  volume  for  this 
population  of  trees  is  5.247  cubic  feet  per  tree. 
The  variance  of  ub  is  ufQ'QJ^u's2  = 
0.023208.  and  the  0.95  confidence  interval  for 
u  /?  is 

5.247  =  2.074  (.023208)  w 
=  5.247  =  0.315957 

The  examples  show  that  weighted  regression 
is  no  more  difficult  than  ordinary  least  square 
analysis.  Interval  estimates  are  easily  obtained 
from  ordinary  multiple  regression  analysis  of 
the  transformed  data.  Special  computer  pro- 
grams are  not  needed. 
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